Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix realtime ingestion when an entire batch of messages is filtered out #7927

Merged
merged 7 commits into from
Dec 21, 2021

Conversation

richardstartin
Copy link
Member

eb6800da96a44e6f5125097cc99a368c2f8f8847 creates conditions where LLRealtimeSegmentDataManager cannot advance if a message batch consumed from Kafka is entirely filtered out, e.g. because it only contains tombstones. LLCRealtimeClusterIntegrationTest hangs on this commit.

01f9f16a18b188bd3e15408ad945eda9541caf75 allows offsets to be committed when the entire batch was filtered out. It also fixes some of the myriad warnings and concurrency bugs in LLRealtimeSegmentDataManager.

} catch (PermanentConsumerException e) {
_segmentLogger.warn("Permanent exception from stream when fetching messages, stopping consumption", e);
throw e;
} catch (Exception e) {
// all exceptions but PermanentConsumerException are handled the same way
// can be a TimeoutException or TransientConsumerException routinely
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collapsed identical catch blocks for the sake of the person reading the code.

Comment on lines 436 to 430
} else if (messageBatch.getUnfilteredMessageCount() > 0) {
// we consumed something from the stream but filtered all the content out,
// so we need to advance the offsets to avoid getting stuck
_currentOffset = messageBatch.getLastOffset();
lastUpdatedOffset = _streamPartitionMsgOffsetFactory.create(_currentOffset);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the bug fix, this ensures that we advance after consuming a bad batch.

}
updateCurrentDocumentCountMetrics();
if (streamMessageCount != 0) {
_segmentLogger.debug("Indexed {} messages ({} messages read from stream) current offset {}", indexedMessageCount,
streamMessageCount, _currentOffset);
} else {
} else if (messagesAndOffsets.getUnfilteredMessageCount() == 0) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prevents unnecessary latency when there has been a bad batch, there is probably data waiting to be consumed.

@@ -63,6 +63,11 @@
private final boolean _enableLeadControllerResource = RANDOM.nextBoolean();
private final long _startTime = System.currentTimeMillis();

@Override
protected boolean injectTombstones() {
return true;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set this to false to make the test pass at eb6800da96a44e6f5125097cc99a368c2f8f8847

throws IOException {
super.close();
List<ConsumerRecord<String, Bytes>> messageAndOffsets = consumerRecords.records(_topicPartition);
List<MessageAndOffset> filtered = new ArrayList<>(messageAndOffsets.size());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that a list was being materialised anyway in KafkaMessageBatch, it's just easier to do it here because we can also capture the last offset. This is likely more efficient than using Iterables.filter anyway.

@@ -40,7 +40,7 @@
* versions of the stream implementation
*/
@InterfaceStability.Evolving
public interface StreamPartitionMsgOffset extends Comparable {
public interface StreamPartitionMsgOffset extends Comparable<StreamPartitionMsgOffset> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This caused a lot of warnings which make code harder to read, and helps the compiler to prevent heap pollution bugs.

* @return number of messages returned from the stream
*/
default int getUnfilteredMessageCount() {
return getMessageCount();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented for Kafka 2.0 only.

@codecov-commenter
Copy link

codecov-commenter commented Dec 18, 2021

Codecov Report

Merging #7927 (29e5d20) into master (58e7f10) will increase coverage by 0.03%.
The diff coverage is 84.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #7927      +/-   ##
============================================
+ Coverage     71.08%   71.11%   +0.03%     
- Complexity     4109     4116       +7     
============================================
  Files          1593     1593              
  Lines         82373    82379       +6     
  Branches      12269    12274       +5     
============================================
+ Hits          58555    58586      +31     
+ Misses        19867    19852      -15     
+ Partials       3951     3941      -10     
Flag Coverage Δ
integration1 29.01% <84.00%> (+0.01%) ⬆️
integration2 27.64% <60.00%> (+0.07%) ⬆️
unittests1 68.11% <0.00%> (+0.05%) ⬆️
unittests2 14.33% <56.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...in/stream/kinesis/KinesisPartitionGroupOffset.java 17.39% <ø> (ø)
...ava/org/apache/pinot/spi/stream/LongMsgOffset.java 92.30% <ø> (ø)
...java/org/apache/pinot/spi/stream/MessageBatch.java 0.00% <0.00%> (ø)
...in/stream/kafka20/KafkaPartitionLevelConsumer.java 85.00% <84.61%> (-1.67%) ⬇️
...manager/realtime/LLRealtimeSegmentDataManager.java 73.35% <100.00%> (+1.53%) ⬆️
...pinot/plugin/stream/kafka20/KafkaMessageBatch.java 92.30% <100.00%> (+0.64%) ⬆️
...ller/helix/core/minion/TaskTypeMetricsUpdater.java 80.00% <0.00%> (-20.00%) ⬇️
...nction/DistinctCountBitmapAggregationFunction.java 47.66% <0.00%> (-9.85%) ⬇️
.../org/apache/pinot/core/startree/StarTreeUtils.java 69.89% <0.00%> (-2.16%) ⬇️
...e/pinot/segment/local/io/util/PinotDataBitSet.java 95.62% <0.00%> (-1.46%) ⬇️
... and 24 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 58e7f10...29e5d20. Read the comment docs.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Thanks for adding the test

@richardstartin richardstartin force-pushed the rt-ingestion-stuck branch 3 times, most recently from 3e3184b to 5214304 Compare December 20, 2021 21:16
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@sajjad-moradi sajjad-moradi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with a few minor suggestions.

@sajjad-moradi
Copy link
Contributor

On a separate note, I saw that originally there were some optimizations piggy-backed to this PR. We should avoid doing that. Each PR should only focus on one feature, one bug fix, etc. Any optimization or refactoring should go in a separate PR. That might take a little longer for the author, but it benefits all of us in the long run.
One example of this issue happened last week when we spent a long time finding the root cause for the chunk index writer bug. The problematic optimization was added in a PR with title Add MV raw forward index and MV BYTES data type. The table having performance problem didn't have any multi-value columns or byte data type. So we couldn't easily tell if this commit is related or not. And there were a lot of commits we needed to examine. If we had a separate PR for that optimization, we could've found the root cause much easier!

@richardstartin
Copy link
Member Author

richardstartin commented Dec 21, 2021

On a separate note, I saw that originally there were some optimizations piggy-backed to this PR. We should avoid doing that. Each PR should only focus on one feature, one bug fix, etc. Any optimization or refactoring should go in a separate PR. That might take a little longer for the author, but it benefits all of us in the long run.

There were no optimisations in this PR, there were some fixes for concurrency bugs (e.g. it's incorrect to use an increment operator on a volatile variable) so I'm not sure what you're referring to.

One example of this issue happened last week when we spent a long time finding the root cause for the chunk index writer bug. The problematic optimization was added in a PR with title Add MV raw forward index and MV BYTES data type.

I'm not sure this is the best place to be discussing this but you are implying that the change you mentioned was unrelated to the PR it was made in, but it wasn't. The purpose of the change, as has been discussed, is that the particular class creates very large buffers. It was included in that PR because its use for MV BYTES columns exacerbates this by multiplying what is already an overestimate of the buffer size by the maximum number of elements in the column. So the change was a mitigation to the worst case made worse by that PR.

The table having performance problem didn't have any multi-value columns or byte data type. So we couldn't easily tell if this commit is related or not. And there were a lot of commits we needed to examine. If we had a separate PR for that optimization, we could've found the root cause much easier!

Looking at commits to figure out what changed isn't an efficient diagnostic technique. Had you instead used a profiler you would have found the process was spending a large amount of time in the syscalls mmap and munmap; looking at the git blame for each Pinot stack frame above those syscalls would have found the culprit in O(stack depth) time rather than O(lines of code changed).

@Jackie-Jiang Jackie-Jiang merged commit f8c7e1f into apache:master Dec 21, 2021
@sajjad-moradi
Copy link
Contributor

@richardstartin I don't want to get to details here. My point was that one PR should focus on one thing, not more!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants